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ABSTRACT 

Summary: The MolClass toolkit and data portal generate 
computational models from user-defined small molecule datasets 
based on structural features identified in lilt and non-hit molecules 
in different screens. Each new model is applied to all datasets in the 
database to classify compound specificity. MolClass thus defines 
a likelihood value for each compound entry and creates an activity 
fingerprint across diverse sets of screens. MolClass uses a variety 
of machine-learning methods to find molecular patterns and can 
therefore also assign a priori predictions of bioactivities for previously 
untested molecules. The power of the MolClass resource will grow 
as a function of the number of screens deposited in the database. 
Availability and implementation: The MolClass webportal, software 
package and source code are freely available for non-commercial use 
at http;//tyerslab.bio.ed.ac.uk/molclass. A MolClass tutorial and a 
guide on how to build models from datasets can also be found on the 
web site. MolClass uses the chemistry development kit (CDK), WEKA 
and MySQL for its core functionality. A REST service is available 
at http://tyerslab.bio.ed.ac.uk/molclass/api based on the OpenTox 
API 1.2. 
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1 INTRODUCTION 

Bioactive molecules can serve as powerful tools for interrogation 
of biological systems and/or as precursors in drug discovery. An 
objective in chemical systems biology is to model biological systems 
in order to understand the effects of small molecules on cellular 

f irocesses, and th ereby explain the basis for small molecule action 
HopkinsJiiooil) . Realization of this ambitious goal will require 
extensive experimental datasets. The generation of chemical datasets 
from biological screening assays is usually limited by cost and 
throughput. Pharmaceutical companies and academic groups use 
high-throughput screens to test large libraries of small molecules that 
elicit a desired biological response, typically against a single target or 
at most a few related targets. However, chemical space is estimated 
to contain on the order of 10^*' molecular entities, which greatly 
exceeds even the multi-million compound libraries at the disposal 
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of large pharmaceutical companies l lDobsonj2004 . This vastness of 
chemical space requires that researchers devise rational approaches 
for identifying small bioactive molecules, particularly given the 
severe resource constraints on academic screening initiatives. The 
computational evaluation of potential bioactive molecules can drive 
down the high cost of screens and help extract potential drug-like 
compounds from pre-existing data in the public domain. To enable 
the extraction of information from existing chemical screen data, 
we have developed a suite of machine-based learning tools that 
statistically rank each compound for any given assay in a user- 
defined database. MolClass will thus facilitate the identification of 
specific bioactive molecules and allow the prediction of moieties 
that underpin biological activity. 

2 WORKFLOW FEATURES 

Existing resources for chemical screen data, notably PubChem, 
ChEMBL and ChemBank, are passive repositories that house an 
incomplete matrix of small molecule activity across submitted 
screens of various types, ranging from in vitro binding and enzyme 
assays to complex cellular and whole organism phenotypic assays 
(Fig-QK)- To interrogate such data, MolClass generates a complete 
matrix of compound activities across many screens and thereby 
enables functional predictions for all molecules, even if not tested in 
a specific screen. The user can upload input datasets of up to 20 000 
molecules in SDF file format, in which tags distinguish hit from 
non-hit compounds in one or several screens. MolClass combines 
the datasets to generate a computational model for each screen 
submitted (Fig.[Tj3). These models are then applied to all molecules 
stored in MolClass to predict activity. MolClass currently provides 
either a composite of all molecular 2D chemical descriptors (2529 
bit) or the user can independently choose 152 property descriptors, 
MACCS (166 bit), Substructure (306 bit), CDK extended (1024 bit) 
or PubChem (881 bit) fingerprints. As different machine-learning 
algorithms tend to generate slightly different likelihood values, a 
variety of algorithms are provided in MolClass including Random 
Forest, Naive Bayes, S VM, KNN, Logistic Model Tree and J48. The 
user can apply one or several algorithms to any dataset of interest. 
Unbalanced datasets are boo sted, to maximally d ouble the size of the 
smaller part, using SMOTE jNitesh et a/.ll2002h and further, if they 
exceed a ratio 1 :5 of active versus inactive compounds, are adjusted 
using the WEKA under-sampling method. All models in MolClass 
are then applied to these molecules to generate activity fingerprints. 
For training and testing, MolClass uses 10-fold cross-validation. 
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Fig. 1. MolClass features. (A) current state of data from public 
resources such as PubChem and ChemBank. (B) MolClass workflow 
from experimental data to activity likelihoods. (C) Likelihood scores 
for fenbendazole an d aspirin in 14 differ ent models: (1) neurosphere 

i iroliferation, +/n one iDiamandis et a/j[2007l) : (2) Cac o-2 permeation, +/— 
Hou et a/.ll2004 ; (3) flucanozole synergizer, +/none JSpit zer et Tf/1l2Qllh: 
(4) C aenorhabditis elegans drug bioaccumulation, none/+ ^ Bmn^ T^al] 
I2OIOI) : (5) Ames mutagenicity benc hmark, none/+ iHansen et a/.ll2009l) : 
(6) mutagenicity predicti on, +/none jKazius et a/.ll2005l) ; (7) blood-brain 
barrier penetration, +/- iLi et a/.ll2005l) : (8) PubChem AID 1828 +/none; 
(9) PubChem AID 595 +/- (10) ChemBank 1000423 +/- (11) ChemBank 
1001644 +/- (12) ChemBank 1000359 +/- (13) autofluoresence none/+ 
and (14) ChEMBL TargetID CHEMBL204 none/+. '+' activating, '-' 
inhibiting and 'none' no effect 



The user can examine the model statistics, the hkehhood scores 
for screens of interest and, as shown in Figure[TJl! , single molecule 
hkehhood fingerprints for existing models. Finally, MolClass also 



enables a substructure search using the JME Editor in the event a 
molecule of interest is not present in the database. 

3 CONCLUSION 

MolClass provides a comprehensive overview of compound activity 
in different screens. These profiles can reveal promiscuous activities 
across several screens, which may reflect undesirable off-target 
effects. For experimental datasets, the user can discover structure 
activity relationships because similar structures and activities will 
lead to specific likelihood patterns. As the data collection is 
expanded by users to different biological responses and assay 
formats, the classification power of the portal will increase, and 
thereby facilitate chemical systems biology. 

4 IMPLEMENTATION 

Mol Class is implemented in Java and Perl using CDK (Steinbec k 
et al. |2003|) . WEKA l lHall et a/.ll2009l) and moldb4 ( lHaiderJl2010l) . 
The web interface and REST service are written in PHP5, Slim and 
PEAR and run on a Fedora Linux 8 server, as an Apache HTTP 
service. The data are stored in a MySQL 5.5 database running on a 
separate Fedora Linux 16 server. 
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